prov/efa: support hardware counter by jiaxiyan · Pull Request #12114 · ofiwg/libfabric

jiaxiyan · 2026-04-06T20:28:10Z

Implement cntr_open_ext in fi_efa_ops_gda to create hardware completion
counters using ibv_create_comp_cntr from rdma-core.
Application can optionally provide its own memory for the completion and error
counts, enabling zero-copy observation of completion progress by
co-located processes or devices.
Implement fi_ops_cntr operations (read, readerr, add, adderr, set,
seterr) that delegate to the corresponding ibv_*_comp_cntr functions.

jiaxiyan · 2026-05-01T23:34:08Z

@mrgolin Could you review this? Thanks!

talavr-amazon · 2026-05-06T14:51:53Z

 	size_t			qp_table_sz_m1;
 	struct ofi_genlock		qp_table_lock;
 	int				urandom_fd;
+	uint32_t		max_comp_cntr;


nit: a comment here explaining what this field is used for?

talavr-amazon · 2026-05-06T14:52:22Z

 	efa_device->device_caps = 0;
+#endif
+	efa_device->max_comp_cntr = 0;
+#if HAVE_IBV_DEVICE_ATTR_EX_MAX_COMP_CNTR


nit: move to a function? max_comp_counter_set

shijin-aws · 2026-05-06T23:06:11Z

+
+	cntr = container_of(cntr_fid, struct efa_cntr, util_cntr.cntr_fid);
+
+	/* Progress CQ to complete WQE in SQ and RQ */


I know this is for avoiding cq overrun and resource management (WQ), but I still think this is not desirable per the goal of hw cntr read: a cheaper way to get the completion numbers without involving heavy weighted CQ poll. If we want a way to avoid cq overrun, that can be a documented requirement for application, or a separate change to protect cq overrun elsewhere. Meanwhile, the efa-direct fabric support FI_PROGRESS_AUTO which doesn't require application use fi_cntr_read to progress the completions. So polling cq here is awkward to me.

cc @amitrad-aws @bwbarrett

Since efa-direct claims FI_RM_DISABLED, resource management is the application's responsibility. NCCL GIN will read the counter value directly from hardware without calling fi_cntr_read, so it needs to poll the CQ separately to reclaim queue resources. I want to remove this internal cq polling from the hardware counter path to make this consistent.

efa-direct claims FI_RM_ENABLED today:

libfabric/prov/efa/src/efa_prov_info.c

Lines 103 to 104 in 0ede325

/* EFA direct path retries indefinitely when Receiver Not Ready (RNR) */

prov_info->domain_attr->resource_mgmt = FI_RM_ENABLED;

I thought we agreed that we do need to poll the CQ in the fi_cntr_read() path, even with HW counters?

I just want to call out this doesn't make much sense even if it is the safest approach. Also as I can tell we still bump util counters in efa_cq_poll_ibv_cq , because the PR still bind the cntr to util_ep in fi_ep_bind. Then why don't we read from util cntrs except for FI_REMOTE_WRITE (where there is no completions on the target side of fi_write) which even doesn't have any hardware limit

This is not the correct thing to do in all cases, please make sure the team is aligned and switch the implementation to the decided on approach.

a-szegel · 2026-05-07T17:28:04Z

+
+	cntr = container_of(cntr_fid, struct efa_cntr, util_cntr.cntr_fid);
+
+	/* Progress CQ to complete WQE in SQ and RQ */


This is not the correct thing to do in all cases, please make sure the team is aligned and switch the implementation to the decided on approach.

Add per-signal and per-counter EFA endpoints with hardware completion counters that write directly to GPU memory, enabling the GPU kernel to poll signal/counter values without host involvement. Counter memory is allocated via the CUDA VMM API (cuMemCreate with gpuDirectRDMACapable) and exported as a DMA-BUF fd. The NIC writes the counter value directly to GPU HBM via cntr_open_ext with FI_EFA_MEMORY_LOCATION_DMABUF. Changes: - m4: Add configure probe for fi_efa_comp_cntr_init_attr - nccl_ofi_cuda: Add nccl_net_ofi_gpu_vmm_alloc/free using CUDA VMM API for RDMA-capable GPU allocations that support DMA-BUF export - dev header: Define nccl_ofi_gin_dev_counter_handle struct - resources: Add gdaki_hw_counter (VMM alloc + DMA-BUF + cntr_open_ext), gdaki_sc_endpoint (EP + two hw counters + QP/CQ + per-peer addressing); refactor gdaki_fi_endpoint into open() + enable() for counter binding - createContext: Create sc_endpoints when nSignals/nCounters > 0, wire signal_handles/counter_handles into the device handle - Bump fi_getinfo to FI_VERSION(2, 5) so libfabric populates domain_attr->max_cntr_value from device capabilities - Teardown: endpoint declared after counters so QP is destroyed before counters (required since counters are attached to QP) Requires libfabric with hardware counter support (PR ofiwg/libfabric#12114). Tested on p6b.200.48xlarge: data PASS, signal counter = 1 (PASS), GPU-read signal counter = 1 (PASS), clean teardown.

cntr_cnt in domain_attr is the optimal number of completion counters supported by the domain. According to man page, it may be a fixed value of the maximum number of counters supported by the underlying hardware, or may be a dynamic value, based on the default attributes of the domain. Set it as the maximum number of counters supported by EFA device, or leave it as 0 when hardware counter is not supported. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

For efa-direct, set max_cntr_value and max_err_cntr_value via fi_getinfo based on the comp_count_max_value and err_count_max_value from EFA device and user hints. The protocol path cannot use hardware counter because it generates multiple completion events per user operation. For API version < 2.5, default to UINT64_MAX. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

cuMemCreate with gpuDirectRDMACapable and exported as a DMA-BUF fd. The NIC writes the counter value directly to GPU HBM via cntr_open_ext with FI_EFA_MEMORY_LOCATION_DMABUF. Changes: - m4: Add configure probe for fi_efa_comp_cntr_init_attr; defines HAVE_FI_EFA_COMP_CNTR when libfabric exposes the type. All hardware- counter code paths added by this commit are guarded on this macro. - nccl_ofi_cuda: Bind 9 cuMem* / cuDeviceGet driver functions and add nccl_net_ofi_gpu_vmm_alloc/free using the CUDA VMM API. The allocation requests gpuDirectRDMACapable so the buffer supports DMA-BUF export, which is what cntr_open_ext requires. - dev header: Define nccl_ofi_gin_dev_counter_handle (qp, cq, cntr_value pointer, per-peer addressing). Add counter_handles, signal_handles, nCounters, nSignals on nccl_ofi_gin_gdaki_dev_handle. Layout is shared with NCCL's mirror struct in nccl_device/gin/efa_gda/gin_efa_gda_dev.h. - resources header / cpp: Add gdaki_hw_counter (RAII over the VMM allocation + DMA-BUF fd + cntr_open_ext) and gdaki_sc_endpoint (EP + write_cntr + remote_write_cntr + QP/CQ + per-peer addressing + two device handles, one for the WRITE counter and one for the REMOTE_WRITE counter). Endpoint is declared after the counters so C++ destructs the QP before the counters (binding requirement). gdaki_fi_endpoint::open is split into open() (binds CQ + AV but does not call fi_enable) plus enable(), so callers can bind additional resources (such as counters) before enabling. - createContext: When nSignals or nCounters is nonzero, allocate max(nSignals, nCounters) sc_endpoints, allgather each one's fi_addr, and populate per-peer addressing. Build GPU-resident arrays of nccl_ofi_gin_dev_counter_handle pointers (d_counter_handles, d_signal_handles), patch each handle's cntr_value to the appropriate counter (FI_WRITE for counters, FI_REMOTE_WRITE for signals), and wire the array pointers into the device handle. - createContext: Bump fi_getinfo to FI_VERSION(2, 5) on the GIN proxy info path (nccl_ofi_gin_resources.cpp::get_gin_info) and on the rdma init path (nccl_ofi_rdma.cpp::nccl_net_ofi_rdma_init) so libfabric reports the hardware-counter capability. - createContext: Call ctx->endpoint.enable() explicitly after open() now that gdaki_fi_endpoint::open no longer enables. Requires libfabric with hardware counter support (ofiwg/libfabric#12114). When the libfabric headers do not expose fi_efa_comp_cntr_init_attr, HAVE_FI_EFA_COMP_CNTR is undefined and the sc_endpoint code path is compiled out; existing data-only Put behavior is unchanged.

Implement hardware counter open/close and fi_ops_cntr operations (read, readerr, add, adderr, set, seterr, wait) that delegate to the corresponding ibv_*_comp_cntr functions from rdma-core. Application is responsible for calling fi_cq_read to prevent CQ overrun. SKip cntr add/adderr in the cq polling path for hardware counter. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

… memory Add cntr_open_ext to fi_efa_ops_gda to create hardware completion counters with optional application-provided external memory for the completion and error counts, enabling zero-copy observation of completion progress by co-located processes or devices. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

Attach hardware completion counter to QP with ibv_qp_attach_comp_cntr after QP is created in RESET state during ep enable. We cannot do this during ep bind because QP is not created yet. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

Add efa_hw_cntr_wait() which polls the hardware completion counter until it reaches the requested threshold or the timeout expires. Uses exponential backoff starting at 1 microsecond, doubling each iteration for up to 5 attempts, or repeat 1ms when user asked for infinite timeout. Also fixed efa_cntr_wait since it didn't handle infinite timeout correctly. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

Add fi_efa_hw_cntr fabtest that exercises hardware counters through MSG pingpong operations. The test opens counters via cntr_open_ext from the GDA domain ops, binds them as txcntr/rxcntr, and uses the existing ft_get_cntr_comp path for completion tracking. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

Add RMA write support to fi_efa_hw_cntr via the -o write option. This adds rma_write() and run_rma() functions, and the API_OPTS parsing to select between MSG pingpong (default) and RMA write. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

Add --external-mem flag to fi_efa_hw_cntr that enables external user-provided memory mode. When set, the test allocates buffers and passes them via FI_EFA_MEMORY_LOCATION_VA with the FI_EFA_COMP_CNTR_INIT_WITH_EXTERNAL_MEM flag to cntr_open_ext. Add corresponding pytest cases for pingpong and RMA write with external memory. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

Hardware counter requires firmware support. Add environment variable FI_EFA_USE_HW_CNTR that is not registered via fi_param_define so we can control when to enable it without exposing the variable to applications. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

Guard all the entry points to hardware counter with FI_EFA_USE_HW_CNTR, which is default to false until we enable it. Enable fabtests and unit tests with FI_EFA_USE_HW_CNTR=1. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

EFA device does not support FI_SELECTIVE_COMPLETION and efa-direct is a hardware offloading component, so we shouldn't implement FI_SELECTIVE_COMPLETION in software. Specifically when applications use hardware counter, we are unable to support FI_SELECTIVE_COMPLETION because the device requires the CQ to be polled to avoid CQ overrun and reclaim wr id. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

jiaxiyan marked this pull request as draft April 6, 2026 20:28

jiaxiyan force-pushed the hw-cntr-support branch 2 times, most recently from 63d3a4a to 4e231eb Compare April 7, 2026 19:21

jiaxiyan force-pushed the hw-cntr-support branch from 4e231eb to 3c6c1e2 Compare April 14, 2026 20:10

talavr-amazon reviewed Apr 23, 2026

View reviewed changes

Comment thread prov/efa/src/efa_cntr.h Outdated

Comment thread fabtests/prov/efa/src/efa_hw_cntr_test.c

jiaxiyan force-pushed the hw-cntr-support branch from 3c6c1e2 to 37a80cc Compare April 24, 2026 22:33

shijin-aws reviewed Apr 27, 2026

View reviewed changes

Comment thread prov/efa/src/efa_cntr.c Outdated

Comment thread prov/efa/src/efa_hw_cntr.c Outdated

Comment thread man/fi_efa.7.md

Comment thread prov/efa/src/efa_base_ep.c

Comment thread fabtests/prov/efa/src/efa_hw_cntr_test.c Outdated

jiaxiyan force-pushed the hw-cntr-support branch 9 times, most recently from 0c7d981 to b0a5ead Compare May 1, 2026 18:47

jiaxiyan marked this pull request as ready for review May 1, 2026 18:47

jiaxiyan changed the title ~~prov/efa: Extend GDA domain ops to support hardware counter~~ prov/efa: support hardware counter May 1, 2026

jiaxiyan requested a review from a team May 1, 2026 18:47

shijin-aws reviewed May 4, 2026

View reviewed changes

Comment thread prov/efa/src/efa_prov_info.c

Comment thread prov/efa/src/efa_hw_cntr.c Outdated

Comment thread prov/efa/src/fi_ext_efa.h

Comment thread prov/efa/src/fi_ext_efa.h

jiaxiyan force-pushed the hw-cntr-support branch 2 times, most recently from 501a4be to a2dc8f3 Compare May 5, 2026 21:43

talavr-amazon previously approved these changes May 6, 2026

View reviewed changes

jiaxiyan dismissed talavr-amazon’s stale review via 93cd797 May 6, 2026 21:24

jiaxiyan force-pushed the hw-cntr-support branch from a2dc8f3 to 93cd797 Compare May 6, 2026 21:24

shijin-aws reviewed May 6, 2026

View reviewed changes

a-szegel requested changes May 7, 2026

View reviewed changes

jiaxiyan force-pushed the hw-cntr-support branch 2 times, most recently from 9aa9bff to cdaeb21 Compare May 11, 2026 20:31

jiaxiyan added 2 commits May 18, 2026 09:58

jiaxiyan force-pushed the hw-cntr-support branch from cdaeb21 to 49e2b5f Compare May 18, 2026 18:24

jiaxiyan force-pushed the hw-cntr-support branch 2 times, most recently from 30f8e65 to 68ae03b Compare May 18, 2026 21:12

jiaxiyan added 10 commits May 18, 2026 14:13

prov/efa: Attach hardware completion counter to QP

94f4e60

Attach hardware completion counter to QP with ibv_qp_attach_comp_cntr after QP is created in RESET state during ep enable. We cannot do this during ep bind because QP is not created yet. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

prov/efa: Guard hardware counter with FI_EFA_USE_HW_CNTR

6189b70

Guard all the entry points to hardware counter with FI_EFA_USE_HW_CNTR, which is default to false until we enable it. Enable fabtests and unit tests with FI_EFA_USE_HW_CNTR=1. Signed-off-by: Jessie Yang <jiaxiyan@amazon.com>

jiaxiyan force-pushed the hw-cntr-support branch from 68ae03b to 5f73294 Compare May 18, 2026 21:13


		cntr = container_of(cntr_fid, struct efa_cntr, util_cntr.cntr_fid);

		/* Progress CQ to complete WQE in SQ and RQ */

	/* EFA direct path retries indefinitely when Receiver Not Ready (RNR) */
	prov_info->domain_attr->resource_mgmt = FI_RM_ENABLED;

Conversation

jiaxiyan commented Apr 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jiaxiyan commented May 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

talavr-amazon May 6, 2026

Choose a reason for hiding this comment

Uh oh!

talavr-amazon May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shijin-aws May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiaxiyan May 6, 2026

Choose a reason for hiding this comment

Uh oh!

shijin-aws May 7, 2026

Choose a reason for hiding this comment

Uh oh!

a-szegel May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shijin-aws May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

a-szegel May 7, 2026

Choose a reason for hiding this comment

Uh oh!

a-szegel May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shijin-aws May 6, 2026 •

edited

Loading

a-szegel May 7, 2026 •

edited

Loading

shijin-aws May 7, 2026 •

edited

Loading